The Used Car Price Prediction dataset contains 4,009 vehicle listings collected from the automotive marketplace cars.com. Each row represents a unique car and includes nine key attributes relevant to pricing and vehicle characteristics. Dataset is taken from Kaggle: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset
The dataset provides information on:
Brand and model – manufacturer and specific vehicle model
Model year – age of the car, influencing depreciation
Mileage – an indicator of usage and wear
Fuel type – e.g., gasoline, diesel, electric, hybrid
Engine type – performance and efficiency characteristics
Transmission – automatic or manual
Exterior/interior colors – aesthetic properties
Accident history – whether the car has previously been damaged
Clean title – legal/ownership status
Price – listed price of the vehicle
Overall, the dataset offers a structured overview of key features that influence used car valuation. It is well-suited for analytical tasks such as understanding pricing drivers, exploring consumer preferences, and building predictive models for vehicle prices. # Raw data
We load the original CSV directly from the project data folder using
here() so paths work regardless of the working
directory.
raw_path <- here("data", "raw", "used_cars.csv")
cars_raw <- readr::read_csv(raw_path, show_col_types = FALSE)
Basic structure and summary statistics of the raw dataset:
glimpse(cars_raw)
## Rows: 4,009
## Columns: 12
## $ brand <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", …
## $ model <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35…
## $ model_year <dbl> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202…
## $ milage <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "…
## $ fuel_type <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol…
## $ engine <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "…
## $ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed…
## $ ext_col <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi…
## $ int_col <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl…
## $ accident <chr> "At least 1 accident or damage reported", "At least 1 acc…
## $ clean_title <chr> "Yes", "Yes", NA, "Yes", NA, NA, "Yes", "Yes", "Yes", "Ye…
## $ price <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$…
We base the EDA on the engineered dataset
(data/processed/used_cars_features.csv) that keeps cleaned
numeric fields and derived features like age, mileage in thousands, and
accident flags.
features_path <- here("data", "processed", "used_cars_features.csv")
cars <- readr::read_delim(features_path, delim = ";", show_col_types = FALSE)
| variable | median | mean | p25 | p75 | sd | min | max |
|---|---|---|---|---|---|---|---|
| price_dollar | 28000.00 | 36865.68 | 15500.00 | 46999.00 | 36531.16 | 2000.0 | 649999.00 |
| log_price | 10.24 | 10.19 | 9.65 | 10.76 | 0.82 | 7.6 | 13.38 |
| age | 9.00 | 10.32 | 6.00 | 14.00 | 5.87 | 1.0 | 29.00 |
| milage_k | 63.00 | 72.14 | 30.00 | 103.00 | 53.60 | 0.0 | 405.00 |
| horsepower | 310.00 | 331.51 | 248.00 | 400.00 | 120.32 | 76.0 | 1020.00 |
| accident | n | share |
|---|---|---|
| At least 1 accident or damage reported | 871 | 0.28 |
| None reported | 2194 | 0.72 |
Median listing sits around $28k, with the middle 50% between roughly $15.5k and $47k, while the maximum reaches $650k—explaining the heavy right tail. Median age is 9 years (IQR: 6–14), typical mileage is about 63k miles (IQR: 30k–103k), and horsepower clusters around 310 HP (IQR: 248–400). About 28% of cars report an accident or damage, a meaningful factor for pricing.
Raw prices are extremely right-skewed, with most listings below $80k but a long tail of luxury and exotic vehicles. Modeling on this scale would be dominated by a few high-price outliers.
Log transformation produces a more bell-shaped distribution and stabilizes variance, making linear-style models and visual comparisons more reliable.
Prices decline with age across fuels. Electric listings start high but show the sharpest early drop; diesel holds comparatively high prices across ages (though the diesel sample is small), and gasoline sits lower overall.
Among the 12 most common brands, Porsche leads on median price, followed by Land Rover and Mercedes-Benz; Volume brands (Toyota, Nissan, Jeep) cluster lower with tighter spreads, while some (Chevrolet, Ford) span broader lineups.
Higher mileage correlates with lower prices. We use a loess smoother (not a straight trendline) and cap the x-axis at 250k miles to reduce the influence of extreme outliers; automatics show a steady decline, and the smaller manual subset is noisier but similar in direction.
Cars with reported accidents trade at a clear discount relative to clean histories, even after log-scaling prices, confirming accident history as an important predictor.